02/01/2026
The task is formulated as a sequence-to-sequence style transfer problem, where an input sentence in normal Khmer is transformed into its corresponding royal-style sentence.
Let:
Conditional Probability
\[ P(Y \mid X; \theta) = \prod_{t=1}^{T'} P\!\left(y_t \mid y_1, \dots, y_{t-1}, X; \theta\right) \]
Training Objective
\[ \theta^* = \arg\max_{\theta} \sum_{(X,Y)\in\mathcal{D}} \log P(Y \mid X; \theta) \]
Notation
Cleaning Pipeline
Dataset Splitting
stoi).<sos>, <eos>, <pad>, <unk>.pad_sequence.Diagram: LSTM Autoencoder Architecture
Diagram: Flow from Input through Encoder to Lower-Dimensional Space, then through Decoder back to Output
Parallel Corpus
Data Structure Example
| Normal | Royal |
|---|---|
| លោកបានដើរកាត់តាមឆ្នេរសមុទ្រ | ព្រះអង្គស្តេចយាងកាត់តាមឆ្នេរសមុទ្រ |
Text Cleaning
Tokenization
stoi).<sos> and <eos> to every sequence.The mechanism begins by calculating how well each encoder state \(h_s^{enc}\) matches the current decoder needs.
Alignment Score: Measures the relevance of input \(s\) at decoding step \(t\): \[e_{t,s} = h_{t-1}^{dec} \cdot h_s^{enc}\]
Attention Weight: Normalizes scores into a probability distribution using Softmax: \[\alpha_{t,s} = \frac{\exp(e_{t,s})}{\sum_{k=1}^{T} \exp(e_{t,k})}\]
The model then extracts relevant information to generate the final character.
Context Vector (\(c_t\)): A weighted sum of all encoder hidden states: \[c_t = \sum_{s=1}^{T} \alpha_{t,s} h_s^{enc}\]
Final Prediction: The decoder hidden state \(h_t^{dec}\) is updated with \(c_t\), and the next character is predicted: \[P(y_t \mid y_{<t}, X) = \text{Softmax}(W_{hy} h_t^{dec} + b_y)\]
Pre-training BLEU (test):
| Model | BLEU (%) |
|---|---|
| General Text | 30.1 |
| Folktale Text | 9.4 |
Fine-tuning (samples):
| Generated Output | Actual Output | BLEU |
|---|---|---|
| ព្រះមហាក្សត្រប្រទានព្រះរាជបន្ទូលថា បាន | ព្រះមហាក្សត្រប្រទានព្រះរាជទ្រព្យជួយរាស្ | 0.79 |
| ព្រះនាងមិនឱ្យមានព្រះរាជបុត្រពីព្រះនាងមា | ព្រះនាងមិនឱ្យភិលៀងធ្វើព្រះរាជកិច្ចជំនួស | 0.73 |